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Abstract 

An  approach  to  computer  recognition  of  continuous 
speech  through  phoneme  identification  is  presented.  Speech 
data  is  recorded  on  a tape  recorder,  then  digitally  sampled, 
fast  Fourier  transformed  and  logarithmically  compressed  into 
16  frequency  bands.  This  processed  data  is  then  used  in 
running  crosscorrelation,  phoneme  recognition  and  location 
programs.  Once  the  phonemes  are  located  and/or  recognized, 
a ranking  of  possible  phonemes  is  selected.  This  procedure 
was  used  on  four  different  speakers  using  both  continuous 
and  discrete  speech.  Phoneme  averaging  was  used  to  improve 
previous  results  by  nearly  28%.  The  rank  ordering  and  new 
decision  scheme  improved  recognition  by  47%.  The  final 
improved  phoneme  location  and  recognition  rates  were  76.9% 
and  72.0%  on  dissimilar  speakers. 


X 


I 


COMPUTER  IDENTIFICATION 
OF  PHONEMES 
IN  CONTINUOUS  SPEECH 


I . Introduction  1 


This  paper  is  a continuation  of  work  begun  by 
Major  Ralph  W.  Neyman  and  improved  by  Captain  William  R. 
Hensley  on  the  problem  of  machine  based  speech  recognition. 
The  advantages  of  a free-speech  input  to  computing  machinery 
are  widely  recognized  and  have  been  the  basis  of  extensive 
research  projects  by  many  groups  around  the  world 
(Ref  1:319).  Present  literature  expresses  the  opinion 
that  a true  continuous  speech  recognition  system  is  still 
years  in  the  future,  and  even  then  the  systems  may  be  highly 
restrictive  (Ref  10:531).  The  preliminary  results  of  Neyman 
and  Hensley  seem  to  contradict  this  belief  and  are  the  basis 
for  this  continued  research. 

Motivation 

The  past  few  years  have  brought  various  degrees  of 
success  in  computer  speech  recognition  systems.  Some  sys- 
tems which  are  quite  accurately  recognizing  limited  vocabu- 
laries are  presently  on  the  market  and  are  listed  in 
Table  I. 

Some  of  these  limited  recognition  systems  have  found 
practical  use  and  one  is  available  in  kit  form  for  the 
hobbiest.  A practical  system  is  that  used  by  paraplegics 
for  voice  control  of  their  wheelchairs  (Ref  11:346).  The 
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wheelchair  device  is  only  an  eight  word  system,  but  it  does 
point  to  the  potential  of  an  unrestricted  speech  under- 
standing machine.  In  this  case  the  only  mode  of  communi- 
cation left  for  the  man/machine  system  was  the  voice.  This, 
however,  is  not  the  only  reason  for  desiring  a voice  recog- 
nition system.  "Even  with  the  best  communication  aids  of 
high  technology,  speech  remains  unrivaled  as  the  fastest  and 
most  convenient  way  for  human  beings  to  communicate  inter- 
actively (Ref  13:40)."  As  an  example,  speech  transfers 
information  at  approximately  twice  the  rate  of  that  possible 
by  a good  typist.  Speech  surpasses  written  or  keyboard 
oriented  inputs  with  respect  to  speed  and  ease  of  infor- 
mation transfer.  Consequently,  speech  input  is  the  natural 
and  most  desirable  way  of  the  man/machine  interface. 

To  mention  a few  of  the  possible  applications  of  speech 
recognition.  Table  II  is  included.  This  is  by  no  means  an 
all  inclusive  list  but  does  point  out  some  of  the  possible 
military  tasks  v;hich  could  be  automated  were  a reliable 
speech  recognition  system  available. 

Objective 

The  overall  objective  of  this  project  was  to  improve 
and  change  as  necessary  the  Neyman/Hensley  recognition  and 
location  scheme  (Ref  5) . The  essence  of  the  program,  as  it 
existed  and  after  it  was  highly  modified,  v/as  the  analysis 
of  spectral  information. 

This  approach  to  the  speech  recognition  problem  has 
inherent  limitations.  To  completely  recognize  speech  more 

___  3 JjJ 
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Table  II 

Military  Tasks  for  Possible  Automation 
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1)  Security 

1.1  Speaker  Verification  (Authentication) 

1.2  Speaker  Identification  (Recognition) 

1.3  Determining  emotional  state  of  speaker  (e.g.,  stress 
effects) 

1.4  Recognition  of  spoken  codes 

1.5  Secure  access  voice  identification,  whether  or  not  in 
combination  with  fingerprints,  facial  information, 
identity  card,  signature,  etc. 

1.6  Surveillance  of  communication  channels 

2)  Command  and  Control 

2.1  System  control  (ships,  aircraft,  fire  control,  situa- 
tion displays,  etc.) 

2.2  Voice-operated  computer  input/output  (each  telephone 
a terminal) 

2.3  Data  handling  and  record  control 

2.4  Material  handling  (mail,  baggage,  publications, 
industrial  applications) 

2.5  Remote  control  (dangerous  material) 

2.6  Administrative  record  control 

3)  Data  Transmission  and  Communication 

3.1  Speech  synthesis 

3.2  Vocoder  systems 

3.3  Bandwidth  reduction  or,  more  general,  bit-rate 
reduction 

3.4  Ciphering/coding/scrambling 

4 ) Processing  Distorted  Speech 

4.1  Diver  speech 

4.2  Astronaut  communication 

4.3  Underwater  telephone 

4.4  Oxygen  mask  speech 

4.5  High  "G"  force  speech 


! 
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(Ref  1:310) 


Automatic  speech  recognition — as  the  human  accom- 
plishes it — will  probably  be  possible  only  through  the 
proper  analysis  and  application  of  grammatical,  con- 


textual, and  semantic  constraints.  This  approach  also 
presumes  an  acoustic  analysis  which  preserves  the  same 
information  that  the  human  transducer  (i.e.,  the  ear) 
does.  It  is  clear,  too,  that  for  a given  accuracy  of 
recognition,  a trade  can  be  made  between  the  necessary 
linguistic  constraints,  and  complexity  of  vocabulary, 
and  the  number  of  speakers  (Ref  4:163). 

In  realization  of  the  above,  the  overall  goal  of  this 

project  v/as  to  improve  the  location  and  recognition  of 

phonemes  by  use  of  spectral  information  only.  Rank  ordering 

of  the  possible  phonemes  by  the  decision  scheme  makes 

possible  the  future  addition  of  a linguistic/syntax  program. 


The  desired  result  of  this  investigation  was  an 
improved  phoneme  recognition  and  location  scheme  for  the 
analysis  of  discrete  and  continuous  speech  by  dissimilar 
speakers.  Four  speakers  were  chosen  and  all  recorded  two 
sentences.  Each  sentence  was  first  spoken  discretely  then 
continuously.  These  sentences  were  then  used  for  program 
analysis . 

To  establish  a base  line,  the  original  program  was 
used  to  generate  location  and  identification  percentages  of 
a 15  class  phoneme  set.  This  set  was  generated  by  speaker 
number  one  on  his  first  discrete  sentence.  A learning 
scheme  was  then  introduced  to  investigate  the  effects  of  a 
"fluid"  template  (phoneme)  set;  that  is,  a template  set  that 


5 


would  represent  not  one  particular  speaker  but  the  average 
from  a selected  group  of  speakers.  To  evaluate  this  idea, 
the  first  two  speakers  were  used  to  generate  the  averaged 
set  from  both  their  discrete  and  continuous  sentences.  The 
other  two  speakers  were  also  evaluated  using  these  modified 
prototypes . 

Following  the  initial  performance  evaluation  and  subse- 
quent change  evaluations , the  phoneme  set  was  expanded  to 
26  templates.  This  new  set  of  phonemes  was  used  to  evaluate 
a new  decision  scheme.  The  averaging  program  was  used  on 
the  first  two  speakers  while  all  four  speakers  were  used  in 
data  collection.  The  recognition  rate  was  then  compared 
against  the  original  program  recognition  rate.  The  new 
decision  scheme  also  rank  orders  the  top  five  possible 
phoneme  choices . 

To  aid  in  evaluating  the  results  of  the  crosscorrela- 
tion program,  a graph  routine  was  written.  This  program 
displayed  the  phoneme/sentence  crosscorrelation  data  in 
graph  format,  plotting  amplitude  against  time.  This  gives 
a visual  display  of  how  each  phoneme  correlates  with  the 
sentence  being  evaluated. 
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obtain  the  required  speech  input  data  in  a form  which  is 
roughly  analogous  to  that  received  by  the  human  ear.  Since 
the  basic  function  of  the  outer  ear  system  is  to  transform 
the  pressure  variations  of  sound  into  a format  which  can 
be  used  by  the  frequency  analysis  portions  of  the  middle 
ear,  the  initial  portion  of  data  acquisition  mimics  this 
function  through  the  use  of  an  audio  tape  recorder. 

Basic  data  acquisition  consists  of  speaking  the  desired 
sentence  into  one  channel  of  a reel-to-reel  stereo  tape 
recorder.  Tone  "markers"  of  2 khz  are  spaced  at  ten  second 
intervals  on  the  second  channel.  These  tones  serve  as  a 
calibration  system  which  allows  personnel  operating  the 
preanalysis  programs  to  have  a means  of  knowing  the  location 
of  each  input  sentence.  Figure  1 illustrates  the  overall 
layout  of  the  initial  data  acquisition  scheme. 
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Figure  1.  Initial  Data  Acquisition 
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Ill . Data  Preprocessing 


Initial  work  done  in  the  area  of  speech  recognition 
involved  the  use  of  a vocoder  to  simulate  the  actions  of 
the  inner  ear  system.  The  inner  ear  serves  to  accept  the 
pressure  changing  inputs  of  the  outer  ear  as  its  input.  It 
outputs  the  speech  data  in  the  form  of  a running  frequency 
and  amplitude  analysis  of  its  input.  The  vocoder  served  to 
effectively  simulate  this  action  of  the  ear  by  operating  as 
a series  of  matched  filters  which  gave  an  indication  of  the 
amplitude  of  the  output  of  each  filter  at  a particular  time. 
A running  analysis  of  the  various  filter  outputs  would  then 
perform  functions  which  could  mimic  some  operations  believed 
to  occur  in  the  brain. 

A simple  block  diagram  of  a vocoder  layout  is  repre- 
sented in  Figure  2.  Note  that  the  analog  outputs  have  to 
be  converted  to  a digital  format  and  then  each  digital  word 
would  have  to  be  recorded  in  some  sequence  for  further 
analysis  by  the  information  processing  program. 

Neyman  found  that  operation  of  the  vocoders  available 
at  that  time  presented  problems  with  both  accessibility  and 
reliability  (Ref  8) . He  chose  to  initiate  an  analysis  pro- 
gram which  would  serve  to  imitate  the  vocoder  through  the 
use  of  Fast  Fourier  Transform  (FFT)  techniques. 
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Figure  2 , Block  Diagram  of  Vocoder 


Vocoder  Simulation 


The  stages  of  speech  signal  preprocessing  described 
below  are  accomplished  by  the  Analog/Hybrid  Systems  Branch 
of  the  ASD  Computer  Center. 

Amplification.  The  first  stage  of  the  preliminary 
signal  preprocessing  consists  of  amplifying  the  speech 
signals  to  a level  which  can  be  used  by  a digital-to-analog 
converters  of  the  Comcor  ci-5000/6  analog  computer.  The 
amplifiers  contained  within  this  machine  are  usable  to  only 
2.5  khz.  Since  high  quality  speech  data  contains  frequencies 
of  close  to  5 khz,  it  was  necessary  to  compensate  in  some 
manner  for  the  reduced  amplifier  response.  To  do  this,  the 
input  speech  tape  was  played  at  a speed  of  3 3/4  inches  per 
second.  The  resulting  audio  signal  v/as  low-pass  filtered  to 
insure  a max  input  frequency  of  2.5  khz.  The  input  signal 
was  then  sampled  at  twice  this  rate  (5  khz)  in  order  to 
satisfy  the  Nyquist  sampling  requirements.  IJote  that  this 
overall  procedure  is  equivalent  to  playing  the  tape  at  its 
originally  recorded  speed  of  7 1/2  inches  per  second, 
filtering  the  tape  to  eliminate  signals  above  5 khz,  and 
sampling  the  filtered  output  at  10  khz. 

The  sampled  signals  were  then  boosted  to  100  volts  to 
allow  accurate  sampling  by  the  11-bit  analog-to-digital 
(A/D)  converters.  The  output  of  the  A/D  converters  was  a 
binary  representation  of  a four  digit  number,  and  described 
the  amplitude  of  the  analog  voltage  output  at  a particular 
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time.  These  numbers  served  as  a digital  representation  of 
the  time  varying  audio  signal  and  were  used  as  input  to 
the  frequency  analysis  portion  of  data  preprocessing. 

Frequency  Analysis.  Ever  since  the  techniques  of 
Discrete  Fourier  Transform  computations  were  perfected  in 
the  late  1960 's,  the  uses  of  the  Fast  Fourier  Transform 
(FFT)  to  compute  a frequency  analysis  of  time-varying  data 
have  been  well  known  and  documented  (Ref  2:41-52).  It  is 
this  property  of  the  FFT  which  is  used  in  the  frequency 
analysis  portion  of  the  signal  preprocessing.  The  technique 
of  analysis  is  implemented  as  follows : The  incoming  digital 
representations  of  the  analog  signals  are  taken  in  sets  of 
128  and  used  as  a 1 x 128  input  array  to  the  FFT  algorithm. 
Since  each  frequency  sample  represents  the  analog  output  at 
10“'*  seconds,  128  of  these  samples  represent  a total  elapsed 
time  of  12.8  x 10  ^ seconds.  The  algorithm  computes  the 
Discrete  Fourier  Transform  (DFT)  of  this  time  series  and 
returns  the  amplitudes  of  each  complex  number  in  the  fre- 
quency spectrum. 

These  frequency  amplitudes  represent  the  frequency 
spectrum  of  the  input  time  signal.  Each  point  in  the  FFT 
array  represents  an  integral  multiple  of  78.125  hz 
(10  khz/128).  Since  the  input  time  series  x(t)  is  com- 
posed of  only  real  numbers,  the  real  part  of  the  Fourier 
transformed  series  X(w)  is  symmetric  about  the  folding 
frequency  (one-half  the  sampling  frequency)  . The  magnitudes 
of  the  Fourier  transformed  series  are  also  symmetrical 
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about  this  folding  frequency.  The  final  result  is  that, 
although  128  samples  are  taken,  and  a 128  point  DFT  is  done 
on  the  time  series,  only  the  first  64  of  the  resulting 
transformed  values  are  used  to  represent  the  frequency 
spectrum  of  that  portion  of  the  analog  signal. 

Figure  4 on  page  15  illustrates  pictorially  the  appli- 
cation of  the  FFT  techniques  to  the  input  signal.  The  FFT 
transformations  are  done  on  a special  purpose  Xerox  Digital 
computer  operated  in  conjunction  with  the  Comcor  computer 
mentioned  previously.  The  actual  program  which  produces  the 
desired  results  is  called  AMPSPC  and  is  implemented  by  the 
Analog/Hybrid  Systems  Branch,  ASD. 
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Data  Storage.  Since  each  12.8  x 10  seconds  of  speech 
data  will  incur  a storage  requirement  of  64  "sets"  of 
numbers,  it  is  obvious  that  a two  or  three  second  segment  of 
speech  can  involve  the  generation  of  over  1.5  x 10^  data 
points.  In  addition,  the  requirements  of  the  correlation 
programs  for  varied  input  data  for  subsequent  analysis 
makes  it  necessary  for  the  data  preparation  of  more  than  one 
sentence.  It  is  thus  necessary  to  place  the  frequency 
analysis  data  in  a form  which  is  easily  stored  and  is  still 
accessible  to  the  subsequent  stages  of  analysis. 

The  format  used  to  store  all  speech  data  after  the  fre- 
quency analysis  stage  is  to  write  it  in  a form  which  is  com- 
patible v/ith  the  input  capabilities  of  the  Cyber/6  600  com- 
puter. The  data  is  thus  "written"  on  a computer  library 
tape  which  is  stored  in  the  ASD  Computer  Center  for  access  by 
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Figure  4.  Frequency  Spectrum  Using  FFT 
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subsequent  programs  (in  particular:  GSl) . Once  the  orig- 
inal audio  signals  have  been  transformed  into  frequency 
analysis  data  and  written  onto  a computer  library  tape 
(L-tape) , the  data  preprocessing  phase  is  complete. 
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After  the  analog  signal  has  been  digitized  and  written 
on  tape  in  the  manner  described  in  section  three,  the  dig- 
ital records  are  in  a form  which  is  usable  by  subsequent 
processing  stages.  The  data,  as  it  now  exists,  is  in  the 
form  of  a digitized  output  of  64  discrete  audio  filters, 
each  having  a center  frequency  of  some  integral  multiple 
of  78.125  hz.  Each  number  represents  the  averaged  output 
of  a particular  filter  over  an  interval  of  12.8  milli- 
seconds . 

Channel  Reduction  and  Equalization 

Due  to  the  fact  that  the  ear-brain  system  seems  to 
respond  to  ratios  of  frequencies  rather  than  absolute 
frequency  values,  and  since  previous  work  with  vocoders 
seemed  to  indicate  that  fewer  frequency  filters  still 
yielded  intelligible  information,  the  64  original  channels 
were  reduced  in  a manner  which  would  simulate  the  ear's 
actual  sensitivity  to  frequency  changes  (Ref  5:85). 

The  64  channels  are  reduced  in  the  following  manner. 
The  first  channels,  representing  output  center  frequencies 
from  approximately  78  to  470  hz  are  left  unchanged.  The 
remaining  58  channels  are  separated  into  approximately  1/3 
octave  groups.  The  channels  within  each  group  are  then 
added  (thus  weighting  the  values  at  the  high  end  of  the 
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frequency  scale)  and  combined  into  a total  of  10  additional 
channels.  The  final  result  is  16  channels  with  center 
frequencies  as  given  in  Table  III. 

The  overall  effect  of  this  channel  reduction  is  some- 
what analogous  to  a phono  equalization  or  preemphasis  curve. 

Through  the  use  of  this  preemphasis,  the  high  frequency 
information  is  not  lost  in  comparison  to  the  much  greater 
eunplitudes  of  the  lew  frequency  information.  The  actual 
frequency  equalization  curves  for  the  vocoder  and  the 
digital  simulation  are  shown  in  Figure  5.  Note  that  the 
curves  are  roughly  similar,  with  the  higher  frequencies 
being  given  slightly  more  emphasis  on  the  digital  simulation 
scheme . 

Visual  Output  (Spectrogram) 

Once  the  speech  data  is  in  a form  which  can  be  handled 
by  the  analysis  portions  of  the  recognition  schem.e,  it  is 
helpful  to  look  at  the  output  in  a form  which  makes  possible 
a visual  analysis.  Much  work  has  been  done  in  this  area  by 
Potter,  Kopp,  and  Green  (Ref  9).  They  found  that  there 
seemed  to  be  enough  visual  clues  in  a time-frequency  spec- 
trogram to  allow  trained  personnel  to  do  a remarkably 
accurate  job  of  interpreting  the  original  speech.  Thus, 
the  next  step  is  to  transform  our  speech  data  into  a form 
which  would  resemble  a speech  spectrogram.  The  method  used 
involves  the  implementation  of  a two-dimensional  printing 
scheme  which  plots  the  output  of  each  frequency  channel  on 
one  axis  and  the  time  of  the  occurrence  on  the  other  axis . 
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The  processing  program  has  been  devised  so  that  it  will 
present  both  the  numerical  magnitude  of  the  output  of  each 
frequency  channel  as  well  as  a pictorial  representation  of 
the  combined  sixteen  channel  outputs.  The  pictorial  repre- 
sentation uses  an  overprint  arrangement  which  causes  each 
channel  to  become  increasingly  dark  as  the  channel  cimplitude 
increases.  Figure  6 gives  an  example  of  a specific  speech 
occurrence  along  with  its  representative  spectrogram.  The 
speech  spectrograms  obtained  in  this  manner  closely  mimic 
the  frequency-time  spectrograms  mentioned  by  Potter,  Kopp, 
and  Green. 

The  computer  program  which  performs  the  frequency 
reduction,  equalization,  and  spectrogram  output  is  listed 
as  program  OCTAVEl  (GS2)  in  the  appendix.  In  addition, 
OCTAVEl  stores  the  reduced  data  on  permanent  files  to  be 
accessed  by  later  stages  of  the  recognition  process . 

Data  Normalization 

The  very  nature  of  speech  gathering  and  digital  repre- 
sentation assures  a recognition  scheme  of  having  a time 
varying  input  which  can  fluctuate  between  wide  limits.  Even 
the  same  sentence  spoken  by  the  same  speaker  will  have 
different  representations  in  amplitudes,  timing,  etc.  It  is 
for  this  reason  that  data  normalization  is  required.  Pre- 
vious work  done  by  Neyman  (Ref  8)  and  Hensley  (Ref  5)  has 
emphasized  the  importance  of  data  normalization.  There  are 
two  types  of  normalization  which  will  be  used  in  this  paper; 
column  normalization  and  unit  normalization.  The  column 
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Figure  6.  Digital  Speech  Spectrograun 
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normalization  procedure  shall  be  discussed  below.  Due  to 
other  program  considerations,  the  unit  normalization  will 
be  accomplished  during  the  correlation  section. 

Column  Normalization.  Following  the  completion  of  the 
data  reduction  program  mentioned  previously  (OCTAVEl) , the 
data  is  available  as  a 16  x M array,  where  M is  the  length 
of  the  sentence  in  12.8  msec  intervals.  The  output  of  each 
"filter"  at  each  time  increment  is  a column  of  16  numbers. 
The  column  normalization  procedure  consists  of  replacing 
each  element  e^i^  in  the  column  with  e^/E  where  E is  computed 
as  follows : 


E 


E 

i =1 


What  has  been  done  is  that  the  "energy"  of  each  column  has 
been  found,  and  each  element  has  been  divided  by  that 
energy.  This  procedure  serves  as  a type  of  automatic  gain 
control  and  insures  that  the  energy  of  each  column  is  equal 
to  one.  The  fact  that  each  column  has  an  energy  of  one  will 
be  of  great  importance  later  in  arriving  at  normalizing 
values  for  the  correlation  vectors. 

The  program  which  implements  this  normalization  pro- 
cedure is  called  0CTAVE2  (GS3)  and  is  listed  in  the  appen- 
dix. Note  that  0CTAVE2  uses  the  files  created  by  OCTAVEl. 
This  program  serves  only  to  provide  a visual  output  of  the 
frequency  spectrograms.  The  correlation  program  also  will 
use  the  files  created  by  OCTAVEl  and  will  colvimn  normalize 
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the  data  for  its  own  use.  The  data  was  stored  in  unnormal- 
ized form  to  allow  the  option  of  normalization  in  the  corre- 
lation phase.  Figure  7 illustrates  the  increased  clarity 
which  column  normalization  provides  in  the  spectrogram 
outputs . 

Data  Base 

Due  to  the  fact  that  the  preprocessing  and  processing 
phases  are  quite  lengthy  and  can  take  weeks  to  implement,  it 
was  decided  to  collect  speech  data  in  quantities  large  and 
varied  enough  to  serve  as  both  control  and  test  data 
throughout  the  research.  The  thrust  of  this  investigation 
was  to : 

1.  Establish  a "baseline"  performance  using  unmodified 
recognition  schemes  devised  by  previous  experiments; 

2.  Investigate  a different  procedure  for  the  creation 
of  prototype  "templates"  used  in  the  correlation  phase; 

3.  Investigate  the  reduction  in  performance  of  speech 
recognition  programs  caused  by  multiple  speakers  and  widely 
varying  spacing  between  v;ords ; and 

4.  Design  a modified  correlation  and  recognition 
scheme  which  would  have  increased  versatility. 

The  decision  was  made  to  use  test  sentences  which  were 
spoken  in  two  different  modes,  discrete  and  continuous 
speech.  Discrete  speech  is  a sentence  in  which  each  word  is 
pronounced  carefully  and  distinctly,  with  definite  "dead" 
space  between  words.  Continuous  speech,  on  the  other  hand. 
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Figure  7.  Normalized  vs  Non-normal i zed  Spectrograms 


could  best  be  described  as  the  sentence  spoken  in  a natural 
manner  with  no  allowances  made  for  machine  recognition. 


Sentence  Preparation 

Since  the  production  of  a "complete"  set  of  phonemes 
representive  of  the  English  language  was  not  a goal  of  this 
paper , it  was  decided  to  use  two  sentences  which  seemed  to 
represent  a reasonable  cross-section  of  phoneme-like  sounds. 
The  specific  sentences  were  chosen  so  as  to  have  combina- 
tions of  sounds  which  would  present  the  correlation  and 
recognition  phases  with  a realistic  and  hopefully  complex 
set  of  data.  The  two  sentences  chosen  were: 

1.  "Kirk  here,  beam  me  up,  Scotty." 

2.  "Quoth  the  Raven,  nevermore." 

Four  male  speakers  were  chosen  to  recite  the  two  above 

sentences  in  the  following  manner: 

1.  Sentence  one  - discrete  speech 

2.  Sentence  one  - continuous  speech 

3.  Sentence  two  - discrete  speech 

4 . Sentence  two  - continuous  speech 

Each  speaker  spoke  a total  of  four  sentences.  The 
combined  data  base  consisted  of  sixteen  sentences.  All  were 
recorded  and  processed  as  described  previously  and  were 
stored  in  the  16  channel  reduced  form. 

Phoneme  Analysis 

The  two  sentences  used  were  then  analyzed  for  phonemic 
content.  The  following  analysis  of  the  sentences  is  not 
meant  to  be  definitive,  but  is  designed  to  serve  as  the 
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basis  for  obtaining  phonemic  patterns  which  would  serve  as 
an  adequate  data  base  for  subsequent  portions  of  the  recog- 
nition scheme. 

Sentence  one  was  found  to  contain  15  phonemes.  The 
beginning  and  ending  sounds  of  "K"  from  Kirk  and  "M"  from 
me  and  beam  were  given  separate  phonemic  representations  to 
allow  further  research  into  the  differences  of  these  allo- 
phones.  The  sentence  and  its  phonemic  representation  used 
by  subsequent  programs  are  listed  below.  The  subscripts 
following  the  K and  M are  the  results  of  the  differentiation 
between  beginning  and  ending  phonemes . 

Sentence  One 

"Kirk  here,  beam  me  up,  Scotty." 

Phonemic  Representation 

Kjj  UR  Kg  H I R B IE  Mg  IE  UP  S K AH  T IE 

Sentence  two  contained  11  additional  phonemes.  Sen- 
tence two  and  its  phonemic  representation  are  listed  below. 

Sentence  Two 

"Quoth  the  raven,  nevermore." 

Phonemic  Representation 


QU  00  Tg  Tjj  E R A V EN  N EE  V ER  Mj^  00  ER 
These  analyses  are  admittedly  arbitrary  in  differen- 
tiation of  individual  phonemic  characters.  Indeed,  a true 
professional  analysis  of  two  persons  speaking  the  same 
sentence  would  stand  a good  chance  of  yielding  slightly  dif- 
ferent results  when  analyzed.  What  is  true  about  all 
sentences  of  the  same  reading,  however,  is  that  virtually 
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any  human  versed  in  English  will  interpret  the  sentence  in 
the  same  manner.  It  is  this  relative  nonvariance  in  inter- 
pretation which  is  the  goal  of  the  recognition  portions  of 
this  paper. 

Phoneme  Extraction 

Previous  research  dealing  with  this  method  of  contin- 
uous speech  recognition  used  the  speech  inputs  of  one 
speaker  to  establish  a prototype  phoneme  set  which  was  then 
used  as  a data  base  for  further  speech  inputs  by  the  same 
speaker.  In  addition,  sentences  spoken  by  other  speakers 
were  tried,  always  with  quite  poor  results  (Ref  5) . It  was 
decided  to  use  the  techniques  employed  previously  in  order 
to  establish  a "base  line"  performance  rating  which  could 
then  be  used  as  a reference  for  performance  of  the  various 
procedural  and  technical  modifications  which  were  to  be 
introduced. 

The  first  phoneme  prototype  set  which  was  constructed 
consisted  of  the  fifteen  phoneme-like  sounds  contained  in 
sentence  one.  The  discrete  sentence  spoken  by  the  first 
speaker  was  chosen  as  the  sentence  from  which  the  phonemes 
were  extracted.  The  phonemes  were  taken  from  segments  of 
the  N X 16  array  which  composed  the  stored  results  of  the 
preanalysis  and  reduction  programs. 

The  spectrogram  representation  of  the  sentence  was 
first  visually  analyzed  for  the  possible  locations  of  the 
target  phonemes.  The  phoneme  positions  were  noted  and  a 
program  was  implemented  which  extracted  the  desired  portions 


1 
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arrays  which  were  to  be  used  by  the  correlation  program. 

Pictorially,  the  phoneme  extraction  process  is  repre- 
sented in  Figure  8 . 


Note  that  this  selection  program  is  universal  in  the  sense 
that  it  can  be  used  to  extract  and  store  any  number  of 
portions  of  any  number  of  sentences.  It  is,  in  fact,  used 
in  a modified  manner  to  produce  the  "fluid"  prototypes 
mentioned  in  the  next  section.  This  program  is  called 
PROAVG  (GS4)  which  is  listed  and  discussed  in  Appendix  A. 
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Phoneme  Averaginc 


Following  work  done  with  the  phonemes  obtained  from 


the  first  sentence,  it  was  decided  to  introduce  a modified 


phoneme  extraction  process  which  was  designed  to  take  into 


account  the  variability  introduced  into  phonemic  structure 


by  two  specific  cases: 


1.  The  same  speaker  saying  the  same  sentence,  but  at 


a faster  rate  (discrete  vs  continuous  speaking) . 


2.  Different  speakers  saying  the  Scime  sentence. 


Analysis  of  spectrograms  produced  by  GS3  showed  that 


the  overall  spectral  content  of  the  sentences  was  remarkably 


similar.  Although  the  discrete  spectrograms  proved  to  be 


much  easier  to  interpret,  the  basic  speech  form  of  identical 


words,  even  when  spoken  by  different  speakers,  seemed  to  be 


preserved.  Therefore,  the  concept  of  prototype  averaging 


was  implemented. 


The  averaging  method  involved  using  the  same  phoneme 


extracting  program  as  before.  Following  the  phoneme  extrac- 


tion from  the  sentences  of  interest,  the  phonemes  extracted 


from  like  words  of  both  continuous  and  discrete  speech  were 


then  averaged  point  by  point.  This  required  that  all 


prototypes  be  of  identical  length.  Since  lengths  varied 


somewhat  with  respect  to  speaker,  this  problem  was 


approached  by  finding  the  shortest  phoneme  length  available 


for  a given  phoneme  in  a given  sentence.  Representative 


portions  of  like  phonemes  from  other  sentences  were  chosen 


to  be  the  same  length. 
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Although  the  problem  of  varying  phoneme  lengths  would 
seem  at  first  to  be  quite  difficult,  analysis  of  spectro- 
graph data  revealed  that  the  variations  in  length  of  both 
the  discrete  and  continuous  sentences  with  multiple  speakers 
to  be  on  the  order  of  20%.  Since  the  largest  prototype  used 
in  this  research  was  16  x 13,  this  variation  in  length  is 
only  30  milliseconds  which  presented  no  great  problem. 

Figure  9 illustrates  the  procedure  used  in  phoneme 
averaging.  Table  IV  and  V lists  the  phonemes  used  in  the 
15  class  set  (baseline  data  research) , and  the  26  class 
phoneme  set  used  for  evaluation  of  correlation  and  selection 
scheme  changes. 
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Table  IV 

Prototypes:  15  Class  Problem 


Phonome 

1 Representation 

Length 

Word  Taken  From 

2 

Kirk 

UR 

9 

Ki^rk 

2 

Kirk 

; 

i H 

7 

H^ere 

^ I 

8 

He^re 

1 ^ 

7 

Here 

B 

3 

Beam 

IE 

7 

Beam 

Me 

5 

Beam 

Mb 

7 

Me 

U 

7 

up 

P 

3 

UE 

S 

6 

S^cotty 

AH 

13 

Scotty 

T 

4 

Scotty 

/ 

1 

NOTE:  Both  a discrete 

and  averaged 

prototype  set  was  used 

in  15  class  problem.  Lengths  were  the  same. 
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Table  V 


Prototypes;  26  Class 

Problem 

Phoneme 

Length 

Word 

Taken  From 

QU 

7 

Quoth 

00 

8 

Quoth 

■Te 

7 

Quoth 

Tb 

10 

•^e 

E 

7 

The 

A 

6 

Raven 

W 

4 

Raven 

EN 

6 

Raven 

N 

6 

Never 

EE 

7 

Never 

ER 

6 

Never 

NOTE:  These  are  in  addition  to  those  listed  in  Table  IV. 

These  were  all  averaged  prototypes. 
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V.  Recognition  Processin 
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The  correlation  phase  consists  of  attaching  the  files 
containing  the  stored  prototype  data  and  input  speech  data, 
then  performing  a running  crosscorrelation  of  each  proto- 
type with  the  sentence.  The  output  of  this  correlation 
procedure  is  an  N x M array  where  N is  the  number  of  proto- 
types contained  in  the  prototype  set  and  M is  the  time 
length  of  the  sentence.  The  value  of  each  element  is  the 
correlation  of  that  particular  prototype  with  the  sentence 
at  a particular  time.  While  the  actual  correlation  proce- 
dure is  relatively  simple,  various  steps  have  to  be  taken 
in  order  to  prepare  both  the  sentence  and  the  prototypes  for 
the  correlation  computations.  These  steps  include  normali- 
zation, array  augmentation,  and  DFT  operations. 

Normalization 

One  aspect  which  has  been  emphasized  by  both  I'leyman 
(Ref  8)  and  Hensley  (Ref  5)  was  the  importance  of  data 
normalization.  Section  IV  dealt  with  the  improvement 
attained  in  the  clarity  of  spectrogram  analysis  by  the 
process  of  column  normalization.  As  stated  before,  the 
data  was  not  stored  in  a normalized  form  so  that  it  could 
be  used  in  this  section  in  an  unchanged  form.  The  corre- 
lation program  is  arranged  so  that  column  normalization  is 
optional.  However,  all  data  processed  and  analyzed  in  this 
paper  used  the  normalization  option.  This  was  done  to 
minimize  the  effect  of  speaker  variation.  Both  prototype 
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and  sentence  arrays  are  column  normalized  as  they  are  read 
from  the  permanent  files.  This  column  normalization  pro- 
cedure is  the  only  normalization  that  is  done  to  the 
sentence  data.  The  prototype  arrays  are  unit  normalized  as 
well  as  column  normalized,  but  this  additional  step  occurs 
following  DFT.  The  description  of  the  unit  normalization 
along  with  the  reasons  for  doing  it  at  a later  stage  will  be 
discussed  in  the  section  dealing  with  Fourier  filtering. 

Array  Augmentation 

A major  goal  of  this  project  was  the  modification  of  a 
previous  correlation  prograim  to  make  possible  a multi- 
segmented  program  which  would  allow  greater  latitude  in 
postcorrelation  analysis  techniques.  While  actual  real- 
time correlation  techniques  are  not  difficult,  they  require 
such  a vast  amount  of  computations  that  even  large  scale 
processing  systems,  such  as  the  CDC  Cyber  6600,  would 
require  excessive  amounts  of  time  to  do  the  computations. 

Improvements  in  past  years  of  techniques  of  computing 
the  DFT  of  matrices  such  as  the  Fast  Fourier  Transform  (FFT) 
have  made  it  possible  to  use  the  properties  of  the  Fourier 
Transform  to  greatly  reduce  the  computations  needed  for 
correlation  (Ref  1).  However,  the  use  of  the  FFT  requires 
safeguards  against  certain  problems  which  are  created.  The 
most  important  of  these  problems  are  aliasing,  leakage,  and 
end  effect. 

Aliasing  is  the  tendency  of  a high  frequency  signal  to 
"mimic"  that  of  a lower  frequency  signal  if  the  sample  rate 

. J 


is  not  sufficiently  high.  This  problem  was  solved  during 
the  digitization  phase  by  insuring  that  the  sampling  rate 
(10  khz)  was  double  the  highest  allowed  voice  signal  (5  khz) 
and  filtering  to  insure  that  no  frequency  above  5 khz  was 
retained. 

Leakage  occurs  due  to  the  inherent  properties  of  the 
DFT  of  a finite  data  record.  This  "rectangular  window"  of 
the  time  series  causes  the  data  to  change  in  the  DFT  in  a 
manner  which  can  alter  the  overall  results.  While  this 
alteration  can  seriously  affect  some  types  of  data,  Neyman 
found  that  overall  results  were  not  altered  by  the  inclusion 
of  various  "window"  functions  which  were  designed  to  handle 
the  leakage  problem  (Ref  8:34).  The  data  used  in  this 
report  were  taken  using  a rectangular  "window". 

End  effect  occurs  as  a result  of  the  periodicity 
imposed  on  a function  by  the  use  of  the  DFT.  Although 
there  are  times  when  this  duplication  of  the  original  func- 
tion can  be  harmless,  the  very  nature  of  the  correlation 
calculations  require  that  a "buffer"  be  included  in  the 
transformed  functions  so  that  the  data  which  is  being  moved 
on  the  time  axis  does  not  run  into  repeated  renditions  of 
the  data  to  be  correlated.  The  problem  can  be  eliminated 
by  insuring  that  the  arrays  to  be  transformed  are  augmented 
in  the  following  manner.  Let  be  the  prototype  array 
and  Sj^j  be  the  sentence  array.  If  P^j  is  of  length  P and 
Sj^j  is  of  length  S,  choose  an  N such  that  the  following 
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two  requirements  are  met : 

N > P+S-1 
N = 2" 

where  n is  an  integer. 

The  augmented  array  is  then  formed  using  the  following 
functions ; 


k=0,l, . . . ,N-P 


l^'O  f 1 f . . . 


kl 


ID 


k=N-P+l ,N-P+2 , . . . ,N-1 


’kl 


ID 


l=j=0,l,. 

f 1 / • . . 

1=16,17,. 
1=16,17, . 

k=i=0,l, . 
l=j=0,l,. 
k=Q,Q+l,. 
1=0,1, .. . 
k=0 , 1 , . . . 
1=16,17, . 


15 


.,15 
N-1 
. ,31 
.,31 

. ,Q-1 

.,15 

. ,N-1 

N-1 

N-1 

.,31 


The  array  transformation  is  picuorially  illustrated 
below. 


3 1 


Augmented  Array  Pj^^^ 

1 1 


Augmented  Array 

Figure  10.  Augmented  Array  Diagram 
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As  can  be  seen  in  Figure  10,  the  new  augmented  arrays, 
which  are  not  both  32  x 64  arrays,  can  be  multiplied  point- 
by-point  and  yield  another  32  x 64  array.  In  addition,  a 
visual  inspection  of  the  results  of  a time-domain  correla- 
tion will  show  that  the  imposed  periodicity  of  a DFT  on 
both  of  these  truncated  functions  will  not  invalidate  the 
correlation  values  due  to  the  extra  "buffer"  space  built 
into  each  array. 

Fast  Fourier  Transform  (FFT) 

Following  the  augmentation  of  both  the  prototype  and 
sentence  arrays,  the  two-dimensional  DFT  is  computed.  The 
transformed  functions  are  as  follows: 


and 

and 

and 


The  complex  conjugate  of  P (P  ) was  then  formed, 

rs  rs 

the  transformed  arrays  were  then  ready  for  the  filtering 
unit  normalization  process.  Following  the  filtering 
normalization  stages,  the  element-by-element  product 


Z = S 
rs  ^rs 


rs 


was  computed. 

The  result  of  this  multiplication  is  equivalent  to 
correlation  in  the  time  domain.  To  obtain  the  actual 
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correlation  values,  the  inverse  transform  of  was  com- 
puted. The  function  is  computed  as  follows: 

"w  ' kE  E "rs  «p 

k=l  1=1  ' I 

Following  the  inverse  Fourier  computation,  the 
correlation  vector  is  formed  by  taking  the  first,  or  zero 
shift,  row  from  the  z array.  This  row  is  written  into  the 
correlation  array  to  be  used  by  the  graphing  and  decision 
schemes . 

where 

k = Q , Q+ 1 , . . . , N 
1 = 1 

(The  first  Q-1  values  are  not  used  due  to  end  effect 
described  on  page  37.) 

Filtering 

Kabrisky  (Ref  6) , Daily  and  Sutton  (Ref  3)  found  that 
filtering  done  in  Fourier  space  helped  to  improve  the  per- 
formance of  two-dimensional  pattern  recognition  devices. 
Hensley  chose  to  incorporate  a filtering  scheme  into  the 
correlation  procedure  by  inserting  a variable  window  filter 
into  Fourier  space.  The  rectangular  filter  as  structured 
served  as  a low-pass  filter  whose  cutoff  frequency  was 
controlled  by  two  variables,  width  and  length.  Although  the 
insertion  of  this  filter  seemed  to  improve  overall  perform- 
ance, the  rectangular  nature  of  the  filter  seemed  to  have 
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the  potential  of  introducing  problems  of  leakage  similar  to 
those  listed  in  an  earlier  section.  It  was  decided  to 


retain  the  filter  as  designed,  but  to  alter  the  program  so 
that  only  one  program  change  would  have  to  be  made  in  order 
to  vary  the  characteristics  of  the  filter.  The  filter 
served  to  remove  high  frequency  information  in  the  Fourier 
transformed  array  as  follows: 


Width 

Figure  11.  Modified  Prototype  Array 


The  zero  section  is  a direct  function  of  the  variables 
width  and  length  which  are  themselves  functions  of  the 
single  variable  "IFILT". 

The  variables  width  and  length  are  formed  as  follows: 
Width  = Midth  = 32-IFILT/2 
Length  = Mength  = 64-IFILT 

v;here 

0 S IFILT  i 64;  0 removes  filter  from  scheme 

64  removes  all  Fourier  information 
Hensley  had  placed  this  filtering  operation  after 
prototype  unit  normalization.  Since  the  filtering  operation 
removed  energy  from  the  prototype,  and  since  correlation 
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normalization  relied  upon  a prototype  array  energy  of  one, 
this  may  have  introduced  inconsistencies  in  the  resulting 
data.  For  this  reason,  the  filtering  operation  was  moved 
so  that  it  was  performed  on  the  prototype  array  after  the 
FFT,  but  before  it  was  unit  normalized  and  the  energy  then 
computed.  The  results  of  the  filter  routine  are  discussed 
in  Appendix  C. 


Unit  Normalization 

The  prototype  array  was  column  normalized  prior  to  the 
DFT  operation.  This  column  normalization  procedure,  as 
discussed  earlier,  insured  that  the  energy  of  the  prototype 
was  a direct  function  of  the  length  since  each  column  had 
an  energy  of  one.  Unit  normalization  was  done  to  insure 
that  each  prototype,  no  matter  what  its  length,  had  unit 
energy  prior  to  the  correlation  computation. 

The  unit  normalization  was  computed  as  follows : 

Pfj  = P^j/ (Energy) ’2 

where 


Energy  = 


(P,j) 


2 


Following  these  computations,  the  prototype  vector  had 
an  energy  of  one.  Since  the  prototype  had  previously  been 
column  normalized,  the  total  energy  value  computed  above 
would  be  a direct  function  of  N,  the  length  of  the  prototype. 
The  total  energy  is,  in  fact,  N,  and  each  element  is  divided 
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by  n”^.  This  will  play  an  important  part  in  the  correlation 
normalization  procedure  which  is  mentioned  in  the  next 
section. 

Correlation  Normalization 

The  correlation  normalization  procedure  is  required  in 
order  to  have  a basis  for  comparison  between  the  maximum 
correlation  values  obtained  over  all  time  for  all  prototypes . 
Obviously  larger  prototypes  will  incur  a greater  maximum 
value  when  they  encounter  portions  of  sentence  data  like 
themselves  than  will  the  shorter  prototypes.  There  must  be 
a method  of  equalizing  the  maximum  values  so  that  the  per- 
formance of  the  prototypes  can  be  compared.  Since  all 
arrays  have  been  column  normalized,  and  since  the  prototype 
arrays  have  been  unit  normalized,  it  is  easy  to  see  that 
the  maximum  correlation  obtainable  by  a prototype  which 
encounters  an  exact  replica  of  itself  in  a sentence  would 
be  (N)^  where  N is  the  length  of  the  prototype.  The  reason 
for  this  is  illustrated  below: 


] 
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V 1. 

tii  = m2 


N/N^  = N- 


Column  Column  + Unit 

Normalized  Normalized 

Prototype  Prototype 


Maximum 

Correlation 

Value 


Figure  12. 


Correlation  Normalization 
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Since  the  previous  section  established  the  fact  that 


I 


the  energy  of  the  prototype,  when  unit  normalized,  was 
also  N,  a correlation  normalization  can  be  computed  by 
simply  dividing  the  correlation  values  by  the  square  root 
of  the  energy  of  the  prototype  (N^) . 

Parseval's  Theorem  states  that  energy  is  conserved 
during  Fourier  operations.  However,  the  nature  of  the 
discrete  Fourier  Transform  algorithm  used  alters  the  actual 
energy  in  a predictable  way.  In  this  particular  case,  the 
energy  is  reduced  by  a factor  of  (N  x M)^  where  N and  M are 
the  dimensions  of  the  transformed  array.  It  is  this  value 
which  was  arrived  at  empirically  by  Neyman  (Ref  8)  and 
Hensley  (Ref  5) . 

The  computed  energy  of  each  prototype  is  stored  as 
Good  (JP)  in  the  computer  program  which  performs  the 
correlation.  As  mentioned,  this  value  is  used  by  the  pro- 
gram to  divide  the  computed  correlation  values  to  insure 
that  all  prototypes  will  incur  correlation  values  between 
zero  and  one.  In  this  way,  the  relative  values  of  each 
prototype  can  be  compared  and  evaluated. 
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Data  Storage 

Following  the  computation  of  correlation  values  for 
all  stored  prototypes,  the  resulting  array  is  stored  in 
permanent  file  for  evaluation  by  the  decision  scheme. 

Storage  in  this  manner  allows  the  access  of  correlated  data 
by  various  modes  of  decision  schemes  and  allows  greater 
versatility  in  the  analysis  of  overall  program  performance. 

t 
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Correlation  Graph  Output  (Calcomp) 

An  interesting  method  of  analyzing  the  results  of  the 
correlation  routine  is  to  present  the  data  in  a manner 
which  is  analogous  to  the  output  of  "matched  filters"  with 
respect  to  time.  The  matched  filters  in  this  case  are  the 
prototype  templates.  A graphing  routine  has  been  designed 
which  allows  selected  prototype  correlation  values  to  be 
sent  to  a Calcomp  Graphing  Routine.  This  routine  graphically 
depicts  the  running  correlation  of  a particular  prototype  in 
a particular  sentence. 

Figure  13  shows  the  output  of  the  first  four  prototypes 
in  the  15  class  problem  as  they  were  correlated  with  the 
first  sentence;  "Kirk  here,  beam  me  up,  Scotty."  The 
sentence  started  at  time  interval  20,  the  word  "Kirk" 
appeared  at  time  interval  78.  Note  the  rapid  rise  in  the 
"KB"  and  "KE"  filters  at  the  appropriate  times  along  with 
the  more  gradual  response  of  the  longer  "UR"  filter. 
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Figure  13.  Correlation  Graph  Output 


VI. 


Decision  Scheme 


Following  the  completion  of  the  correlation  program, 
the  results  are  stored  in  the  form  of  an  M x N array  where 
M is  the  number  of  prototypes  and  N is  the  length.  Each 
element  in  the  array  represents  the  correlation  of  a partic- 
ular prototype  with  the  sentence  at  that  particular  time. 

The  array  can  be  pictured  as  follows : 

M 

Prototype  1 

Correlation 

Values 

Prototype  M 

Time  N 


Figure  14 . Correlation  Array 

The  performance  of  a particular  prototype  throughout 
the  sentence  is  on  the  horizontal  axis.  A comparison  of 
all  prototype  correlations  at  one  particular  time  is 
obtained  by  looking  at  the  vertical  axis.  It  is  this  array 
which  is  used  for  subsequent  portions  of  the  decision 
scheme . 

Threshold 

The  array  is  first  processed  for  values  which  are  above 
a selected  threshold.  Values  equal  to  or  greater  than  the 
selected  threshold  are  left  unchanged;  all  other  values  are 
set  equal  to  zero.  This  threshold  operation  is  easily 
imagined  by  drawing  a horizontal  line  on  the  correlation 
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graphs  mentioned  earlier.  It  is  quite  easy  to  gain  a pre- 
liminary idea  of  when  the  phoneme  occurred  at  a given  time 
by  observing  its  peaks. 

Endurance 

It  is  obvious  from  the  graphs  in  Figure  13  that  the 
shorter  prototypes  have  a much  more  irregular  and  peaky 
correlation  graph.  In  particular,  an  exeunination  of  the 
prototype  "KB"  shows  that  it  has  numerous  peaks,  even 
though  the  actual  "KB"  sound  occurred  only  once  in  the 
sentence.  It  was  decided  that  a time  endurance  criteria 
would  be  established  to  insure  that  "false  alarms"  incurred 
by  momentary  correlation  values  above  threshold  would  not 
serve  to  make  the  final  decision  scheme  any  more  complicated 
than  it  already  was.  The  endurance  criteria  involved 
scanning  the  correlation  array  along  the  time  axis.  When  a 
correlation  value  above  threshold  was  found,  a marker  was 
set.  When  the  correlation  value  fell  below  threshold,  the 
time  the  threshold  v.'as  exceeded  was  checked.  If  this  time 
was  below  some  specified  percentage  of  the  time  value  of  the 
prototype,  e.g.,  1/2  the  length,  that  portion  of  the  array 
was  set  to  zero. 

Following  endurance  processing,  the  correlation  array 
consists  of  values  above  the  desired  threshold  which  stayed 
above  this  threshold  for  an  amount  dependent  on  a function 
of  the  prototype  length. 


Ranking 

Following  the  threshold  and  endurance  processing,  the 
correlation  array  is  ready  for  the  final  decision  output. 

At  first,  it  seemed  quite  logical  to  simply  pick  the  highest 
correlation  value  at  each  time  increment  and  use  this  as 
the  prototype  selected.  However,  preliminary  experimenta- 
tion with  this  process  showed  remarkable  results  when  the 
prototypes  were  autocorrelated  (used  to  judge  the  sentence 
from  which  they  came) , but  poor  results  when  the  prototypes 
were  used  with  sentences  from  different  speakers. 

These  results,  coupled  with  similar  results  from  other 
research  efforts  led  to  the  conclusion  that  there  may  be 
higher  order  decision  schemes  which  are  used  by  the  brain  to 
determine  actual  word  content  of  sentences  following  a less 
than  perfect  recognition  process.  It  was  decided  to  design 
a final  decision  scheme  which  would  simply  list  a number  of 
choices  as  the  representation  of  the  speech.  This  type  of 
output  had  two  advantages : 

1.  It  would  allov;  the  data  to  be  stored  for  further 
processing  by  decision  schemes  which  would  take  into 
account  higher  order  levels  of  structure  such  as  syntax, 
grammer,  and  context. 

2.  It  would  allow  an  easily  understandable  representa- 
tion of  the  relative  performance  of  each  of  the  prototype 
arrays.  Incorrect  decisions  could  be  scanned  for  the  posi- 
tion of  the  correct  choice  so  as  to  determine  methods  of 
producing  more  correct  results. 
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The  ranking  program,  as  used  in  this  paper,  searched 
the  processed  correlation  arrays  at  each  time  segment  for 
the  five  highest  correlation  values.  The  phonemes  which 


incurred  these  values  were  then  printed  in  the  order  of 
their  correlation  values,  from  highest  to  lowest.  This  was 
the  final  computer  processing  stage  in  the  recognition 
process.  However,  it  is  quite  possible  to  store  this  final 
decision  arra_^  to  be  used  as  data  for  some  future  syntax  or 
grammer  processing  routine.  The  overall  processing  stages 
of  the  decision  scheme  are  shown  in  Figure  15.  The  program 
which  performs  the  described  operations  is  called  DECIS 
(GS6)  and  is  listed  in  the  appendix. 
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VII.  Results 


} The  results  are  divided  into  three  main  sections . 

j Section  one  establishes  a baseline  performance  evaluation 

1 

for  the  program  as  originally  designed  by  Neyman  and  modi- 

! 

i fied  by  Hensley.  Section  two  deals  with  the  15-class 

problem  when  prototypes  were  used  from  an  averaged  set  of 
data.  This  section  also  deals  with  the  difficulties 
encountered  with  the  existing  classification  scheme  and  the 
modifications  where  were  incorporated.  Section  three  lists 
the  results  of  an  expanded  26-class  problem. 

15-Class  Problem  (Discrete  Prototypes) 

The  first  task  in  the  modification  of  existing  computer 
algorithms  is  to  establish  a baseline  performance  rating  so 
that  there  is  a guideline  for  comparison.  Additionally,  the 
problem  differences  in  scoring  the  individual  data  outputs 
can  be  lessened.  This  is  due  to  the  fact  that  both  the 
baseline  and  subsequent  results  can  be  scored  using  the  same 
data  base  along  with  the  same  scoring  technique.  Ideally, 
it  would  have  been  best  to  use  the  actual  prototype  and 

i 

sentence  data  from  previous  efforts  in  order  to  see  what 
improvements  could  be  gained  by  revised  methods.  However, 
previous  data  sets  were  unavailable  at  the  beginning  of 
this  research. 

The  15-class  problem  consisted  of  phonemes  extracted 
from  the  sentence:  "Kirk  here,  beam  me  up,  Scotty."  The 
sentence  was  spoken  by  test  subject  one  and  was  spoken 
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in  a way  as  to  insure  that  each  word  was  spoken  clearly, 
distinctly,  and  was  not  connected  with  the  words  surrounding 
it. 


Analysis  of  this  sentence  yielded  a set  of  15  phoneme- 
like sounds  which  were  chosen  to  represent  the  entire 
sentence.  These  phonemes  were  extracted  and  cataloged 
according  to  the  phoneme  extraction  program  as  described  in 
Section  IV.  The  prototypes  were  then  used  in  the  correla- 
tion and  decision  programs  as  listed  in  the  Hensley  paper 
(Ref  5) . Scoring  was  done  in  the  following  manner: 

1.  A phoneme  was  "located"  if  it  fulfilled  the  speci- 
fications listed  for  location  in  the  Hensley  paper. 

2.  A phoneme  was  "identified"  if  it  was  printed  as  a 
correct  selection  in  the  phonemic  representation  of  the 
sentence . 

The  scoring  of  the  sentence  was  as  follows.  The  score  for 
both  location  and  identification  consisted  of  a percentage 
number  which  was  derived  by  dividing  the  number  of  correct 
choices  by  the  total  number  of  phonemic  elements  believed 
to  be  in  the  sentence. 

The  actual  program  performance  is  listed  in  Table  VI. 
The  individual  sentence  scores  are  listed  in  Appendix  C. 

Analysis 

The  overall  performance  of  the  recognition  programs  as 
used  by  previous  researchers  proved  to  be  essentially  per- 
fect when  used  to  autocorrelate  the  phonemic  data.  The 
92%  score  on  sentence  one,  speaker  one-,  was  due  to  an 
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Table  VI 
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Sentence  Spoken;  "Kirk  here,  beam  me  up,  Scotty 


incorrect  phoneme  length  given  to  the  recognition  program. 
When  the  error  is  taken  into  account,  the  location  and 
identification  rate  becomes  100%. 


The  performance  degrades  noticeably  when  the  same 
speaker  recites  the  sentence  at  a different  rate.  The 
performance  degrades  even  more  when  different  speakers  v/ere 
tested.  The  overall  identification  attained  by  the  same 
speaker  was  64.5%,  while  the  identification  attained  for 
all  four  speakers  was  38.1%. 

These  results  are  essentially  consistent  with  estab- 
lished data  in  that  the  recognition  program  performs  in  a 
competent  manner  only  if  it  is  "trained"  with  a specific 
speaker  for  specific  words.  This  type  of  performance  is 
inadequate  for  any  generalized  speech  recognition  system. 

15-Class  Problem  (Averaged  Prototypes) 

The  prototypes  used  in  the  first  portion  of  the  results 
were  then  modified  in  the  manner  described  in  Section  IV. 
Representative  sections  of  sentences  spoken  by  speaker  A 
and  speaker  3 were  chosen  to  contain  like  phonemes . Both  \ 

the  discrete  and  continuous  sentences  were  used.  Scoring 
was  accomplished  in  the  sane  manner  as  described  previously. 

The  results  are  listed  in  Table  VII  with  individual  sen- 
tence performance  listed  in  Appendix  C. 

Analysis 

As  was  expected,  the  averaging  scheme  lowered  the  per- 
formance on  sentence  one.  This  was  as  expected  since  the 
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15-Class  Problem,  Averaged  Prototypes 


prototypes  were  no  longer  formed  by  this  sentence.  However, 
the  location  and  identification  scores  show  a remarkable 


! 

I 


improvement.  In  the  case  of  sentence  7,  the  location  rate 
jumped  from  26.7%  to  100%.  Although  the  identification 
scores  also  improved  in  a similar  manner,  the  overall 
scores  were  much  too  poor  to  be  considered  acceptable.  In 
addition,  the  location  performance  was  extremely  difficult 
to  tabulate  due  to  the  extremely  complex  manner  in  which 
the  location  was  tabulated  in  the  unmodified  programs.  It 
was  at  this  point  that  the  modifications  listed  in  Section 
V were  introduced.  These  modifications  necessitated  an 
amended  scoring  system  which  is  discussed  in  the  next 
section. 

26-Class  Problem 

The  prototypes  used  for  the  2 6 -class  problem  were 
extracted  from  the  two  sentences  and  averaged  as  described 
in  Section  IV.  The  prototypes  as  used  are  listed  in 
Table  V. 

Preliminary  scoring  with  the  extended  prototype  set 
revealed  that  the  existing  program  decision  scheme  functioned 
in  an  extremely  poor  manner.  The  existing  program,  prior 
to  modification,  yielded  scores  which  were  consistently 
below  30%  for  all  sentences.  Hov/ever,  analysis  of  actual 
prototype  correlation  values  showed  that  the  overall  corre- 
lation scheme  was  still  functioning  in  an  accurate  manner. 

It  was  decided  to  change  the  methods  of  presenting  the 
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correlation  data  so  that  analysis  and  scoring  would  be 
made  easier.  Section  V deals  with  the  revised  decision 
scheme  which  was  used  in  this  section. 

When  the  decision  scheme  was  revised,  it  became 
necessary  to  change  the  methods  of  scoring  phonemic  selec- 
tions . 

As  has  been  noted  in  Section  VI , the  output  of  the 
final  decision  scheme  is  in  the  form  of  a ranked  phonemic 
output  of  the  test  sentence.  Each  time  increment  contains 
the  top  five  choices  of  the  correlation  program  which  ful- 
filled the  requirements  of  the  decision  scheme.  See 
Figure  16  for  an  example  of  the  decision  scheme  output. 

The  rules  governing  location  and  identification  were 
modified  to  fit  this  revised  decision  scheme  as  follows.  A 
phoneme  was  considered  to  be  located  if  it  ranked  within 
the  top  three  choices  of  the  phonemic  output.  It  was 
labeled  as  identified  if  it  was  ranked  as  the  first  choice. 
In  this  manner,  there  was  a close  similarity  between  the 
revised  ranking  and  the  original  decision  output.  There 
is  obviously  still  a discrepancy  between  the  performance  of 
the  26-class  problem  when  compared  to  the  two  15-class 
problems.  However,  the  goal  of  this  extended  look  into  the 
phoneme  averaging  process  was  to  insure  that  the  prelimi- 
nary results  obtained  were  not  just  a function  of  the 
particular  data  used.  In  addition,  the  improved  and  more 
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versatile  decision  scheme  which  came  about  because  of  the 


problems  encountered  in  this  phase  of  the  research  will  be 
useful  in  subsequent  inquiries  into  the  nature  of  phonemic 
speech  recognition. 

The  results  of  the  26-class  problem  are  listed  in 
Table  VIII  and  IX. 

Analysis 

The  results  of  the  26-class  problem  are  somewhat 
i;  difficult  to  interpret  due  to  the  limited  amount  of  data 

l! 

collected  along  with  the  lack  of  a suitable  baseline  with 
which  to  compare  the  revised  identification  and  location 
techniques.  However,  the  following  facts  can  be  noted. 

1.  The  location  score  for  both  15-class  and  the  26- 
class  problems  are  identical.  This  means  that  the  corre- 
lation performance  of  the  prototypes  remained  such  that  the 
correct  prototypes  still  scored  within  the  top  three 
highest  ratings.  In  essence,  the  increased  decision  space 
accorded  by  the  increased  prototype  size  does  not  seem  to 
degrade  the  overall  location  performance. 

2.  The  identification  scores  cannot  be  compared 
directly  to  the  previous  scores  due  to  the  revised  methods 
for  identification.  All  four  sentences  scored  showed 
notable  improvements,  but  it  must  be  remembered  that  an 
identification  meant  only  scoring  the  highest  of  any  proto- 
types located.  The  previous  identifications  were  the 
result  of  a much  more  restrictive  scheme  (Ref  5) . 
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3 . Two  test  sentences  from  the  second  sentence  data 
set  were  also  scored.  Although  this  is  a very  small  data 
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set,  the  results  seem  to  indicate  that  the  correlation 
performance  remains  consistent  with  different  sentences. 
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VIII.  Conclusions 


This  research  had  two  basic  goals: 

1.  To  evaluate  existing  methods  of  phoneme  recogni- 
tion and  implement  improvements  in  the  system  with  respect 
to  multiple  speaker  and  continuous  speech  performance ; and 

2.  Modify  the  existing  recognition  programs  so  that 
they  might  be  more  versatile  and  serve  as  input  to  a more 
sophisticated  recognition  system. 

It  is  felt  that  both  goals  have  been  obtained. 

The  original  recognition  system  was  acceptable  when 
it  had  as  its  input  sentences  spoken  by  the  seune  person 
who  created  the  prototypes.  The  introduction  of  multiple 
speakers  and  varying  speech  speeds  showed  that  the  system 
was  highly  unreliable.  The  phoneme  prototype  averaging 
idea  was  an  attempt  to  move  the  prototype  vectors  closer  to 
some  imaginary  center  of  a hypothetical  hypersphere  so 
that  they  would  serve  at  a more  universal  prototype  set. 

The  improvements  were  apparent  in  both  the  15  and  the  26- 
class  problem. 

Obviously,  even  the  26-class  prototype  set  is  not  a 
complete  set.  It  serves,  however,  to  illustrate  that  it  is 
possible  to  still  achieve  acceptable  recognition  and  loca- 
tion scores  even  though  the  decision  space  has  been  almost 
doubled.  This  relative  insensitivity  to  mal-effects  of  an 
increased  decision  space  supports  the  belief  that  there  is 
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possibly  a universal  speech  recognition  machine  which 
would  have  as  its  first  step  a correlation-based  phoneme 
recognizer. 

The  modifications  to  the  decision  scheme  are  ones 
which  greatly  inhance  the  versatility  of  the  program.  The 
final  phonemic  output  can  be  stored  for  futher  processing, 
and  the  output  as  is  printed  is  suitable  for  side-by-side 
analysis  with  the  spectrogram  data.  In  addition,  the  graph 
option  allows  future  researchers  to  have  a pictoral  repre- 
sentation of  the  performance  of  each  phoneme  prototype. 

This  will  allow  insight  into  the  reasons  for  the  poor  per- 
formance of  certain  prototypes  along  with  a basis  for 
improving  the  entire  prototype  set. 

The  speech  recognition  program,  as  it  now  exists,  is 
a versatile,  easily  changed,  speech  phoneme  analysis  system. 
This  program  can  serve  as  the  first  portion  of  a multi- 
segmented  speech  recognition  system  involving  grammer , 
syntax,  and  context  programs. 
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IX.  Recommendations 


There  are  two  classes  of  recommendations  which  are 
listed  below.  Class  one  deals  with  methods  for  phoneme 
preparation,  analysis,  and  correlation.  Class  two  deals 
with  other  modifications  which  would  allow  the  user  to  have 
greater  insight  into  program  performance. 

Class  I 

1.  Use  the  spectrogram  program  (GS3)  to  produce  visual 
outputs  of  the  prepared  prototypes.  This  feedback  process 
will  enable  researchers  to  examine  the  effects  of  prototype 
averaging  to  help  form  prototype  sets  which  are  more 
clearly  distinct. 

2.  Make  future  prototype  lengths  as  short  as  possible. 
Although  shorter  prototypes  have  more  "false  alarms",  there 
seem  to  be  problems  obtaining  consistent  performance  when 
longer  prototypes  are  used. 

3.  Investigate  the  use  of  Fourier-Space  filtering  to 
help  the  decision  process.  Although  the  filter  has  been 
designed  and  tested,  no  actual  data  was  obtained  using 
filtered  results  due  to  time  limitations  and  problems  with 
the  correlation  normalization  factor.  (See  item  4 in  the 
Class  II  recommendations.) 

4.  Investigate  the  possibilities  of  reducing  the 
frequency  response  of  the  input  data  set.  It  may  not  be 
necessary  to  use  all  frequencies  from  100  to  5 khz  to 


initiate  correct  recognition.  The  speech  bandwidth  may  be 
able  to  be  further  restricted  without  hindering  overall 
results . 

Class  II 

1.  Investigate  the  possibilities  of  constructing 
and  using  a device  which  could  reconstruct  actual  sounds 
from  the  phonemic  renditions  of  the  decision  scheme.  The 
use  of  this  device  would  allow  the  researcher  to  obtain  an 
audio  feedback  of  what  the  correlation  program  has  chosen 
as  the  phonemic  output  of  the  sentence.  Such  a device  is 
currently  being  designed  by  these  researchers  and  shows 
great  promise. 

2.  Expand  the  FFT  subroutine  in  the  correlation 
program  (GS5)  so  that  the  entire  sentence  can  be  trans- 
formed at  one  time.  This  would  greatly  reduce  the  com- 
puter time  required  by  reducing  the  number  of  computations 
required.  It  would  also  lessen  the  problems  which  are 
incurred  by  incorrect  correlation  values  near  the  edges  of 
the  present  sentence  arrays.  This  problem  was  accepted  as 
a necessary  evil  which  was  offset  by  the  advantages  to  be 
gained  by  having  an  entire  sentence  correlation  array 
available  for  the  final  decision  program. 

3.  Install  a threshold  routine  which  would  zero 
elements  of  the  sentence  array  when  there  is  no  speech 
present.  This  would  serve  to  remove  many  "false  alarms" 
which  occur  when  prototypes  correlate  with  "noise". 

Neyman  (Ref  8)  used  a threshold  technique  similar  to  the 
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one  mentioned  above.  It  was  removed  by  Hensley  (Ref  5)  and 
all  data  in  this  paper  was  obtained  without  any  thresh- 
olding whatsoever.  Although  the  overall  problem  did  not 
seem  to  be  serious , the  correlation  technique  is  not  at  a 
stage  where  small  improvements  can  increase  accuracy. 

4.  Investigate  a revised  correlation  normalization 
procedure  which  could  be  used  when  Fourier-Space  filtering 
is  utilized.  Preliminary  research  with  the  filter  indi- 
cates that  it  does  indeed  function  in  the  desired  manner. 
However,  the  maximum  correlation  values  are  not  being 
limited  to  a maximum  of  one  as  they  should  be.  It  will  be 
necessary  to  develop  an  algorithm  which  will  recompute  the 
correlation  normalization  factor  as  some  function  of  the 
percentage  of  energy  removed  by  the  filter. 
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A.  Data  Processing  Charts  and  Notes 


This  appendix  contains  the  flow  charts  which  give  an 
overview  of  the  operation  of  the  seven  main  segments  of  the 
speech  recognition  system.  Also  included  are  notes  which 
clarify  important  operating  points  of  each  program.  Listed 
below  are  the  seven  programs  along  with  the  associated 
inputs  and  outputs.  Subsequent  pages  contain  the  flow  chart 
for  each  progreim  along  with  the  notes  concerning  its  opera- 
tion. 

Table  X 

Data  Processing  Programs 


Name 

Input 

Output 

GSl  (Main) 

L-Tape 

PF#1 

GS2  (Octavel) 

PF#1 

PF#2 

Spectrogram 

GS3  (Octave 2) 

PF#2 

Averaged 

Spectrogram 

GS4  (Proavg) 

PF#2 

PF#3 

(Averaged 

prototypes) 

GS5  (Crscor) 

PF#2 

PF#4 

PF#3 

(Correlation 

arrays) 

GS6  (Corgph) 

PF#4 

Calcomp 

Graphs 

GS7  (Decis) 

PF#4 

Phonemic 

Output 

Note:  PF  refers  to  Permanent  Files  on  the  CDC  system. 

These  files  can  be  output  as  cards  if  the  user  so 
desires. 
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GSl  (Main) 


Program  GSl  (Main)  is  used  to  read  data  from  the 
L-Tape  which  was  produced  by  the  ASD  Computer  Center  and 
write  it  on  a permanent  file  (PF) . This  PF  is  used  in  sub- 
sequent processing.  The  program  attaches  the  L-Tape  under 
the  name  of  Tapel.  It  reads  the  tape  and  transfers  it  to  a 
program  file  called  Tape2 . Following  completion  of  the 
transfer,  the  system  catalogs  the  Tape  file  and  gives  it  a 
new  name  of  SENTl . The  new  name  is  entirely  the  choice  of 


GS2  (Octavel) 


GS2  (Octavel)  uses  the  results  of  GSl  as  input  data. 
The  program  attaches  PF#1  created  by  GSl  and  gives  it  a 
local  file  name  of  Tapel.  It  then  reads  this  file  and 
logarithmically  compresses  the  data  from  64  to  16  channels. 
The  reduced  data  is  stored  on  a local  file  called  Tape2 
and  is  stored  for  use  as  PF#2  by  subsequent  portions  of  the 
program.  Two  variables  which  are  important  in  this  program 
are  NREC  and  NN2.  NREC  represents  the  number  of  files  to 
be  read.  A file  is  one  entire  speech  segment.  NN2  should 
be  set  as  a number  which  is  a value  one  more  than  the 


number  of  records  in  that  particular  set.  For  example,  if 
ASD  Center  states  that  the  sentence  data  had  480  records, 
set  NREC  to  481.  This  will  insure  that  the  data  is  handled 
correctly.  Tape2  is  stored  as  PF#2  under  the  name  SENT2 


and  SENT3. 
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GS3  (Octave 2) 

GS3  (Octave2)  uses  as  its  input  the  data  stored  by 
GS2  (PF#2)  and  produces  a normalized  spectrogram.  This 
program  does  not  create  any  permanent  files.  Its  only  pur- 
pose is  to  produce  a spectrogram  which  is  more  easily 
interpreted  than  is  the  one  produced  by  GS2  (Octavel) . 
Although  it  is  necessary  to  read  the  entire  sentence  record 
to  produce  the  spectrogram,  the  two  variables  NSTART  and 
NSTOP  allow  the  user  to  select  only  those  portions  of  the 
sentence  record  which  are  desired.  The  entire  sentence  will 
be  read,  but  only  the  desired  portions  will  be  output  as 
spectrograms . 


Figure  19. 


Program  GS3  (0ctave2)  Flow  Diagram 


75 
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GS4  (Proavq) 

Program  GS4  (Proavg)  is  used  to  create  a permanent 
file  (PF#3)  of  averaged  phonemes.  This  PF  is  attached  by 
GS5  and  used  to  correlate  with  the  input  sentence  data. 

To  run  the  program,  the  sentences  from  which  the 
phonemes  are  being  made  must  be  visually  analyzed  and  the 
locations  of  the  desired  phonemes  noted.  The  program  then 
uses  these  notations  along  with  the  attached  sentence  data 
to  produce  the  prototypes.  The  location  of  each  phoneme  is 
to  be  placed  on  data  cards  which  will  be  read  into  the 
IBEGIN  and  lEND  arrays . 

The  manner  in  which  data  was  collected  required  that 
every  other  sentence  be  read.  This  is  the  reason  for  the 
skip  functions  within  the  program.  Data  PROREC  and  PRO RET 
are  used  to  determine  which  sentence  is  being  analyzed  on 
each  run  through  the  averaging  portion  of  the  program. 

Data  NUPROC  and  NUPROT  are  used  to  tell  the  length  of  each 
set  of  prototypes  being  averaged.  A "set"  is  a collection 
of  prototypes  of  the  same  sound. 

The  program  first  reads  the  IBEGIN  and  lEND  data.  The 
first  sentence  to  be  used  is  selected.  The  phonemes  are 
read  and  written  on  Tape2  based  on  the  values  of  IBEGIN  and 
lEND.  Phonemes  from  like  sentences  are  successively 
selected  based  on  PROREC  and  PRORET  and  written  on  Tape2. 
Thus,  selected  groups  of  like  phonemes  are  then  written 
together  on  Tape2. 
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Tape2  is  then  read  and  each  set  of  phonemes  to  be 
averaged  is  summed  and  then  divided  by  the  number  of  sen- 
tences (in  this  case,  four).  The  averaged  prototypes  are 
written  on  Tape4  and  followed  by  an  end  of  file  statement 
(EOF) . 

To  complete  the  prototype  set,  the  initial  conditions 
are  reset.  Tapel  and  Tape2  are  rewound  and  the  averaging 
scheme  is  repeated  with  the  next  group  of  prototypes  being 
written  on  Tape4  following  the  previous  prototypes . Tape4 
is  then  stored  as  PF#3  under  the  name  PROAVG. 


Figure  20.  GS4  (Proavg)  Flow  Diagram 
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GS5  (Crscor) 


This  program  comprises  the  main  body  of  the  research. 
GS5  (Crscor)  consists  of  a main  program  (Crscor)  which 
inputs  the  necessary  variables  for  correlation,  such  as 
normalization  desired  (NORM) , filtering  (IFILT) , prototype 
lengths  (ITYP) , etc.  Following  the  initialization  of 
desired  variables , the  main  program  calls  a subroutine 
(SCORR)  which  handles  the  correlation  computations.  The 
variables  are  clearly  documented  in  the  program  listing 
found  in  Appendix  B.^  This  program  has  as  its  output  PF#4 
which  consists  of  correlation  values  of  all  prototypes  over 
a selected  length  of  input  data.  PF#4  is  named  CORR.  The 
progreun  list  will  consist  of  prototype  values  as  read  in, 
along  with  the  normalized  values  if  desired.  The  program 
will  also  print  out  information  on  the  subdivision  of  the 
sentence,  number  of  zeroes  required  to  augment  the  data 
arrays,  and  prototype  length.  Each  phase  of  data  process- 
ing is  clearly  labeled  so  as  to  make  it  easier  to  insert 
changes  and  revisions. 


i 


^The  sentence  Data  (PF#2)  is  attached  as  Tapel,  while 
the  prototype  Data  (PF#3)  is  attached  as  Tape6 . 
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Figure  21.  Program  GS5  (Crscor)  Flow  Diagram 

(Plate  1) 
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stored  Correlation 
Arrays 


Figure  23.  Program  GS5  (Crscor) 
(Plate  3) 


8 


GS6  (Corqph) 

GS6  (Corgph)  uses  PF#4  as  its  input.  It  reads 
selected  (by  user)  portions  of  the  correlation  arrays  into 
an  array  called  SAMPLE.  These  values  are  then  sent  to 
special  graphing  routines  which  have  been  attached  to  the 
programs  through  the  control  cards . Following  the  graph 
calls,  the  resulting  data  is  sent  to  the  Calcomp  plotter 
through  the  use  of  the  CALL  PLOTE(N)  instruction.  The  out- 
put is  a page  listing  of  four  correlation  outputs  for  the 
entire  length  of  one  sentence.  The  labels  of  the  axis  of 
the  graphs  are  controlled  within  GS6  and  should  be  changed 
according  to  what  prototypes  are  output. 


I 
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Figure  24.  Program  GS6  (Corgph)  Flow  Diagram 
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GS7  (Decis) 


GS7  (Decis)  attaches  PF#4  and  processes  the  correlation 
arrays  according  to  the  methods  described  in  Section  VI. 

The  input  arrays  which  contain  information  on  the  phoneme 
names  and  length  must  be  altered  for  each  new  set  of 
phonemes.  The  variables  ENDUR  and  THRHLD  are  the  endurance 
(time)  and  correlation  threshold  values,  respectively. 

This  program  has  as  an  output  the  list  of  the  phonemic 
representation  of  the  sentence.  It  is  possible  to  store 
these  results  in  permanent  files  if  desired  in  order  to 
present  these  processed  arrays  to  a future  higher-order 
decision  scheme. 
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Figure  25.  Program  GS7  (Decis)  Flow  Diagram 
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6/7/8/9 


Figure  26.  Program  Main 


6/7/8/9 


Figure  27 . Program  Octavel 
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THIS  PPOGRAH  ATTAC^FS  CORRELATION  ARRAYS  STORED 

IN  PERMANANT  FILES.  IT  THEN  USES  CCAUX  ROUTINES  TO 

GRAPH  THE  DESIRED  CORRELATION  OUTPUTS.  THE  INFORMATION  CON 


CALL  SCALE(SAHPLE,1.5,IPNn,l» 

CALL  AXIS<C. ,0. ,2HKR,L, 1.5, 90. Of SAMPLE (NPl), sample (NP2)) 
GO  TO  1240 


CALL  PLOT(0.,?.5,-3) 

CALL  SCALE(SAMPLE, 1.5, TEND,!) 

CALL  AXIS  (0.,0.,2HU9, 4,1. 5,90. 0,SAHi>LE(NPl),  SAMPLE  (NPZn 
GO  TO  1240 


6/7/8/9 


Figure  32.  Program  Decis 
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'ISE  SUOROJTINE  TO  SORT  THE  PRO  ARRAY  (SORT) 
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SftPTdPOW)  SORT  (ICOL,IROW) 
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APPENDIX  C 
Data  Results 


C.  Data  Results 


Notes  on  Appendix  C 

1.  Sections  in  the  scoring  row  which  contain  dashes 
were  not  scored  due  to  time  limits  within  the  computer  for 
that  particular  data  run.  All  like  sentences  were  limited 
to  the  same  number  of  possible  prototype  occurrences  so 
that  the  scoring  may  be  compared. 

2.  The  complete  set  of  sentence  data  on  the  26-Class 
prototype  problem  was  not  run  due  to  difficulties  in 
obtaining  sufficient  computer  running  time.  The  final 
tcU^les  containing  results  from  the  second  sentence  reflect 
only  a few  runs  which  served  to  show  that  the  overall  per- 
formance is  consistent  with  a different  sentence  input. 
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Legend:  X = Identified 
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Legend:  X = Identified 


Sentence  Analysis,  Speaker  1,  15-Class  Problem 
Averaged  Prototypes 


Sentence  Analysis , Speaker  2 , 15-Class  Problem 


igend:  X *»  Identlfie 
L = Located 


Sentence  Analysis,  Speaker  4,  15-Class  Problem 
Averaged  Prototypes 


Sentence  Analysis,  Speaker  1,  26-Class  Problem 
Averaged  Prototypes 
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D.  Glossary  of  Technical  Terms 


1.  Aliasing ; The  term  "aliasing"  refers  to  the  fact  that 
high-frequency  components  of  a time  function  can  imperson- 
ate low  frequencies  if  the  sampling  rate  is  too  low. 

2.  Allophone : The  variant  forms  of  a phoneme  as  conditioned 
by  position  or  adjoining  sounds . 

3.  Autocorrelation ; The  discrete  convolution  of  the 
function  x(n)  with  x(-n).  Compute  X(k) , the  DFT  of  x(n) , 
and  multiply  by  X*(k).  The  inverse  DFT  of  X(k)X*(k)=  X(k)  ^ 
corresponds  to  the  circular  convolution  of  x(n)  with  x(-n), 
i.e.,  a circular  correlation. 

4 . Crosscorrelation ; The  discrete  convolution  of  the 
function  x(n)  with  the  function  y(-n).  Note  above  and  that 
the  DFT  of  y(-n)  is  Y*(k). 

5.  End  Effect;  The  effect  on  computational  results  caused 
by  the  periodicity  imposed  on  a function  by  use  of  the  DFT. 

6.  Leakage ; The  term  "leakage"  refers  to  the  discrepancy 
between  the  continuous  and  discrete  Fourier  transforms 
caused  by  the  required  time  domain  truncation. 

7.  Phoneme:  The  smallest  distinctive  group  or  class  of 
phones  (an  individual  speech  sound)  in  a language.  In  a 
very  general  sense,  the  phonemes  that  make  up  a speech 
sound  can  be  compared  to  the  letters  that  make  up  a written 
word. 

8 . Template:  The  phoneme  employed  for  matching  in  the 
correlation  program. 
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